Using Parallel Corpora to enrich Multilingual Lexical Resources

نویسندگان

Dominic Widdows

Beate Dorow

Chiu-Ki Chan

چکیده

This paper describes the use of a bilingual vector model for the automatic discovery of German translations of English terms. The model is built by analysing co-occurence patterns in a parallel corpus of English and German medical abstracts, a method also used for CrossLingual Information Retrieval. The model generates candidate German translations of English words using the cosine similarity measure between terms in the bilingual vector space. The correct translations could be added to UMLS, the multilingual dictionary in question. The accuracy of the translations is evaluated by measuring how many of the existing UMLS translations are correctly predicted by the vector translations. The model also detects synonymy, particularly acronyms. An online public demonstration of the model is available.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multilingual Lexical Database Generation from parallel texts with endogenous resources

This paper deals with multilingual database generation from parallel corpora. The idea is to contribute to the enrichment of lexical databases for languages with few linguistic resources. Our approach is endogenous: it relies on the raw texts only, it does not require external linguistic resources such as stemmers or taggers. The system produces alignments for the 20 European languages of the ‘...

متن کامل

Multilingual Lexical Database Generation from Parallel Texts in 20 European Languages with Endogenous Resources

متن کامل

That'll Do Fine!: A Coarse Lexical Resource for English-Hindi MT, Using Polylingual Topic Models

Parallel corpora are often injected with bilingual lexical resources for improved Indian language machine translation (MT). In absence of such lexical resources, multilingual topic models have been used to create coarse lexical resources in the past, using a Cartesian product approach. Our results show that for morphologically rich languages like Hindi, the Cartesian product approach is detrime...

متن کامل

News from OPUS — A Collection of Multilingual Parallel Corpora with Tools and Interfaces

The opus corpus is a growing resource providing various multilingual parallel corpora from different domains. In this article we introduce resources that have recently been added to opus. We also look at some corpus-specific problems and the solutions used in preparing the parallel data for the inclusion in our collection. In particular, we discuss the alignment of movie subtitles and the conve...

متن کامل

Slovene-English Datasets for MT

Advances in machine translation are becoming increasingly dependent on the availability of large scale language resources, in particular parallel corpora. The talk presents Slovene-English language resources that were developed as datasets for translation studies and machine learning programs. Three parallel datasets are introduced: the MULTEXT-East multilingual word-annotated corpus, the IJS-E...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2002

Using Parallel Corpora to enrich Multilingual Lexical Resources

نویسندگان

چکیده

منابع مشابه

Multilingual Lexical Database Generation from parallel texts with endogenous resources

Multilingual Lexical Database Generation from Parallel Texts in 20 European Languages with Endogenous Resources

That'll Do Fine!: A Coarse Lexical Resource for English-Hindi MT, Using Polylingual Topic Models

News from OPUS — A Collection of Multilingual Parallel Corpora with Tools and Interfaces

Slovene-English Datasets for MT

عنوان ژورنال:

اشتراک گذاری